Background and Motivation

The Boston Marathon is one of the world’s oldest and most prestigious marathons. Begun in 1897, the Boston Marathon is the world’s oldest annual marathon and ranks as one of the world’s best-known racing events. It is one of six World Marathon Majors, and is one of four major events held in the United States through the years of both World Wars (Kentucky Derby, Rose Parade, and Westminster Kennel Club Dog Show are the others). The Boston Marathon attracts olympians and world record-holders each year, and as many as 50,000 amateur and professional runners gather to join an elite and competitive field of runners. The Boston Marathon is particularly unique in that registration is not open to all participants–runners must qualify by running a specified marathon time for their age group and gender–making the race especially prestigious and a career-long goal for many runners. Held on Patriot’s Day each year, the race attracts well over 500,000 spectators along its historic and famous 26.2 mile course from Hopkinton into downtown Boston.

This project was motivated by several factors. Despite how famous and historic the race and its course are, very little is publicly shared about how participants actually perform in the marathon. The Boston Marathon makes searchable results available on their website, but reports very little about how runners actually perform on the course. Given that this is one of the world’s most prestigious marathons, and given the unique population of runners that participate in this race, it seemed like an especially interesting and useful dataset to obtain and explore. Additionally, having participated in the boston marathon five times as a runner myself, I was particularly interested in learning more about the race’s results.

The four specific questions I intended to explore for this dataset originally were: Who are the runners of the Boston Marathon? Where do most runners come from, what are their ages, what genders are they? How fast are the runners of the Boston Marathon? What are the average finishing times for men and women, and within five-year age groups (i.e., 25-30 years old)? How do these finishing times compare to the qualifying times needed to gain entry into the race? How do runners run the race? Do they typically start fast, and slow down as the race progresses and they become fatigued, or do the more experienced runners that qualify for Boston pace themselves and achieve an even performance? How does this vary across age groups and genders? Who doesn’t finish the race? Each year, a small number of runners (approximately 1,000) do not finish the race, which results in a result typically referred to as ‘DNF’ (Did Not Finish). What types of runners comprise this group, in terms of age, gender, and nationality?

The final question, however, is not answerable with the dataset I obtained: the BAA (Boston Athletic Association, the race’s organizing body) does not publish results for racers who do not post a finishing time. So, I replaced the final question with:

*How do elite athletes perform in the race?

Data Preprocessing

library(plyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:plyr':
## 
##     here
library(ggplot2)
library(tidyr)
results = read.csv("resultsfinal.csv", stringsAsFactors=FALSE)

#create divisions for age groups, based on agr goups listed at http://raceday.baa.org/statistics.html
results$age_group = cut(results$age, breaks = c(0, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80), labels = c("18-34", "35-39", "40-44", "45-49", "50-54", "60-64", "65-69", "70-74", "75-79", "80+"), right = FALSE, ordered_result = TRUE)

#import qualifying times and add to data
qualTimes = read.table("qualifyingTimes.txt", sep = '\t', colClasses = c("factor", "factor", "character"))
names(qualTimes) = c("age_group", "gender", "qualifying_time")

#convert gender, county_origin to factor variables
results$gender = factor(results$gender); results$gender = factor(results$gender); results$county = factor(results$county)

results = merge(results, qualTimes)

#convert times to duration objects using as.difftime() and lubridate's as.duration()

results$X5k <- as.duration(as.difftime(tim = results$X5k, format = "%T", units = "secs"))
results$X10k <- as.duration(as.difftime(tim = results$X10k, format = "%T", units = "secs"))
results$X15k <- as.duration(as.difftime(tim = results$X15k, format = "%T", units = "secs"))
results$X20k <- as.duration(as.difftime(tim = results$X20k, format = "%T", units = "secs"))
results$half <- as.duration(as.difftime(tim = results$half, format = "%T", units = "secs"))
results$X25k <- as.duration(as.difftime(tim = results$X25k, format = "%T", units = "secs"))
results$X30k <- as.duration(as.difftime(tim = results$X30k, format = "%T", units = "secs"))
results$X35k <- as.duration(as.difftime(tim = results$X35k, format = "%T", units = "secs"))
results$X40k <- as.duration(as.difftime(tim = results$X40k, format = "%T", units = "secs"))
results$pace <- as.duration(as.difftime(tim = results$pace, format = "%T", units = "secs"))
results$official_time <- as.duration(as.difftime(tim = results$official_time, format = "%T", units = "secs"))
results$qualifying_time <- as.duration(as.difftime(tim = results$qualifying_time, format = "%T", units = "secs"))

#results$X5k <- as.POSIXlt(results$X5k, format = "%T")
#results$X10k <- as.POSIXlt(results$X10k, format = "%T")
#results$X15k <- as.POSIXlt(results$X15k, format = "%T")
#results$X20k <- as.POSIXlt(results$X20k, format = "%T")
#results$half <- as.POSIXlt(results$half, format = "%T")
#results$X25k <- as.POSIXlt(results$X25k, format = "%T")
#results$X30k <- as.POSIXlt(results$X30k, format = "%T")
#results$X35k <- as.POSIXlt(results$X35k, format = "%T")
#results$X40k <- as.POSIXlt(results$X40k, format = "%T")
#results$pace <- as.POSIXlt(results$pace, format = "%T")
#results$official_time <- as.POSIXlt(results$official_time, format = "%T")

#exclude any columns with na for any value; this removes 194 runners who didn't cross various checkpoints for any reasin (dropped out, missed checkpoint, foot RFID tag error...)
results = na.omit(results)

Analysis: Four Guiding Questions

1. Who are the runners of the Boston Marathon?

Where do most runners come from, what are their ages, what genders are they?

countrycounts = data.frame(table(results$county))
names(countrycounts) <- c("Country", "Entrants")
ggplot(countrycounts, aes(x = reorder(Country, -Entrants), y = Entrants)) + geom_bar(stat="identity") + xlab("Country") + ggtitle("Entrants By Country") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

#TODO: print top 10 sorted

#TODO: print countries with <5 entries just for fun?

#TODO: filled country map

Individually visualized, the age distributions of female (top) and male (bottom) runners are shown below:

ggplot(results, aes(x=age)) + geom_histogram(fill = "dodgerblue", colour="black", alpha = 0.3) + facet_grid(gender ~ .)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

We can also directly compare the age distributions of male and female runners by overlaying them:

ggplot(results, aes(x=age, fill = gender)) + geom_density(alpha = 0.3)

As seen above, females have a relatively higher proportion of runners younger than approximately 45 years of age, and males have a relatively higher proportion of runners older than 45 (although, in general, the distributions are wuite similar).

2. How fast are the runners of the Boston Marathon?

What are the average finishing times for men and women, and within age groups? How do these finishing times compare to the qualifying times needed to gain entry into the race?

ggplot(results, aes(x=age_group, y=official_time)) + geom_violin(fill = "dodgerblue", alpha = 0.3) + ggtitle("Official Finishing Times By Age Group")

ggplot(results, aes(x=age_group, y=official_time)) + geom_boxplot(fill = "dodgerblue", alpha = 0.3) + ggtitle("Official Finishing Times By Age Group")

ggplot(results, aes(x=gender, y=official_time)) + geom_violin(fill = "dodgerblue", alpha = 0.3) + ggtitle("Official Finishing Times By Age Group")

ggplot(results, aes(x=gender, y=official_time)) + geom_boxplot(fill = "dodgerblue", alpha = 0.3) + ggtitle("Official Finishing Times By Age Group")

#faceted plots with qualifying times for men, women
ggplot(subset(results, gender == "M"), aes(x=gender, y=official_time)) + geom_violin(fill = "dodgerblue", alpha = 0.5) + ggtitle("Official Finishing Times By Age Group (Men)\n Relative to Qualifying Standard")  + geom_hline(aes(yintercept=qualifying_time), size=1, colour = "gold") + facet_grid(. ~ age_group) + theme(axis.text.x = element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank()) + scale_y_continuous(breaks=c(3600, 7200, 10800, 14400, 18000, 21600), labels=c("1:00:00","2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00"))

ggplot(subset(results, gender == "F"), aes(x=gender, y=official_time)) + geom_violin(fill = "dodgerblue", alpha = 0.5) + ggtitle("Official Finishing Times By Age Group (Women)\n Relative to Qualifying Standard")  + geom_hline(aes(yintercept=qualifying_time), size=1, colour = "gold") + facet_grid(. ~ age_group) + theme(axis.text.x = element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank()) + scale_y_continuous(breaks=c(3600, 7200, 10800, 14400, 18000, 21600), labels=c("1:00:00","2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00"))

#TODO: split violin plots for male/female, within each age group --  see http://stackoverflow.com/questions/35717353/split-violin-plot-with-ggplot2

3. How do runners run the race?

Do they typically start fast, and slow down as the race progresses and they become fatigued, or do the more experienced runners that qualify for Boston pace themselves and achieve an even performance? How does this vary across age groups and genders?

#create variable for second half
results$sec_half = results$official_time - results$half


half_times <- results[,c("bib_number", "gender", "age_group", "half", "sec_half")] %>%
  gather(half, time, half:sec_half)
 
half_times$half = factor(ifelse(half_times$half == "half", 1, 2))
half_times$time = as.numeric(half_times$time)

ggplot(half_times, aes(x = half, y = time)) + geom_violin(fill = "dodgerblue", alpha = 0.5)

#Main plot--TODO: adjust color
ggplot(half_times, aes(x = half, y = time)) + geom_violin(aes(fill = half, colour = half), alpha = 0.5) + facet_grid(gender ~ age_group) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5)

ggplot(half_times, aes(x = half, y = time)) + geom_jitter(aes(colour = gender, alpha = 0.5))

#plot difference between first 5k and last 5k
ggplot(results, aes(x=age_group, y=results$X40k - results$X35k-results$X5k)) + geom_violin(fill = "dodgerblue", alpha = 0.3) + stat_summary(fun.y=median, geom="point", fill="gold", shape=21, size=2.5)

ggplot(results, aes(x=age_group, y=results$X40k - results$X35k-results$X5k)) + geom_jitter(colour = "dodgerblue", alpha = 0.3) + scale_y_continuous(limits = c(-1800, 1800), breaks = c(-1800, -1200, -600, 0, 600, 1200, 1800), labels = c("-30:00", "-20:00", "-10:00", "00:00", "+10:00", "+20:00", "+30:00")) + ylab("Difference in minutes") + ggtitle("Difference between first and final 5k times\nby age group")
## Warning: Removed 31 rows containing missing values (geom_point).

#Data processing for pace calculation along various 5k segments
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
segment_5k_times <- results %>%
    mutate(segment_5k = X5k, segment_10k = X10k - X5k, segment_15k = X15k-X10k, segment_20k = X20k-X15k, segment_25k = X25k-X20k, segment_30k = X30k-X25k, segment_35k=X35k-X30k, segment_40k = X40k-X35k) %>%
    gather(km_segment, time, segment_5k:segment_40k) %>%
    mutate(km_segment = as.factor(as.numeric(gsub("k", "", gsub("segment_", "", km_segment))))) %>%
    subset(select = -c(name, age, bib_number, city, origin, X5k:sec_half))

ggplot(filter(segment_5k_times, age_group <= "45-49"), aes(x = km_segment, y = time)) + geom_jitter(colour = "dodgerblue", alpha = 0.3)+ geom_boxplot(fill = "white", width = 0.2, outlier.shape = NA) + facet_grid(age_group ~ gender) + xlab("5k segment") + ylab("Segment Time") + scale_y_continuous(limits = c(900, 2700), breaks = c(1200,  1800,  2400), labels = c("20:00", "30:00", "40:00")) + ggtitle("Performance over 5k race segments by gender and age group\n2015 Boston Marathon")
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 78 rows containing non-finite values (stat_boxplot).
## Warning: Removed 37 rows containing non-finite values (stat_boxplot).
## Warning: Removed 51 rows containing non-finite values (stat_boxplot).
## Warning: Removed 33 rows containing non-finite values (stat_boxplot).
## Warning: Removed 40 rows containing non-finite values (stat_boxplot).
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
## Warning: Removed 56 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing missing values (geom_point).
## Warning: Removed 87 rows containing missing values (geom_point).
## Warning: Removed 37 rows containing missing values (geom_point).
## Warning: Removed 53 rows containing missing values (geom_point).
## Warning: Removed 33 rows containing missing values (geom_point).
## Warning: Removed 40 rows containing missing values (geom_point).
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 56 rows containing missing values (geom_point).
## Warning: Removed 270 rows containing missing values (geom_point).
## Warning: Removed 247 rows containing missing values (geom_point).
## Warning: Removed 270 rows containing missing values (geom_point).
## Warning: Removed 292 rows containing missing values (geom_point).
## Warning: Removed 258 rows containing missing values (geom_point).
## Warning: Removed 251 rows containing missing values (geom_point).
## Warning: Removed 204 rows containing missing values (geom_point).
## Warning: Removed 181 rows containing missing values (geom_point).
## Warning: Removed 223 rows containing missing values (geom_point).
## Warning: Removed 227 rows containing missing values (geom_point).
## Warning: Removed 250 rows containing missing values (geom_point).
## Warning: Removed 261 rows containing missing values (geom_point).
## Warning: Removed 252 rows containing missing values (geom_point).
## Warning: Removed 233 rows containing missing values (geom_point).
## Warning: Removed 230 rows containing missing values (geom_point).
## Warning: Removed 208 rows containing missing values (geom_point).
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 101 rows containing missing values (geom_point).
## Warning: Removed 115 rows containing missing values (geom_point).
## Warning: Removed 113 rows containing missing values (geom_point).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 97 rows containing missing values (geom_point).
## Warning: Removed 98 rows containing missing values (geom_point).
## Warning: Removed 80 rows containing missing values (geom_point).
## Warning: Removed 161 rows containing missing values (geom_point).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 136 rows containing missing values (geom_point).
## Warning: Removed 131 rows containing missing values (geom_point).
## Warning: Removed 134 rows containing missing values (geom_point).
## Warning: Removed 123 rows containing missing values (geom_point).
## Warning: Removed 120 rows containing missing values (geom_point).
## Warning: Removed 91 rows containing missing values (geom_point).
## Warning: Removed 134 rows containing missing values (geom_point).
## Warning: Removed 111 rows containing missing values (geom_point).
## Warning: Removed 126 rows containing missing values (geom_point).
## Warning: Removed 120 rows containing missing values (geom_point).
## Warning: Removed 115 rows containing missing values (geom_point).
## Warning: Removed 108 rows containing missing values (geom_point).
## Warning: Removed 85 rows containing missing values (geom_point).
## Warning: Removed 81 rows containing missing values (geom_point).
## Warning: Removed 227 rows containing missing values (geom_point).
## Warning: Removed 197 rows containing missing values (geom_point).
## Warning: Removed 172 rows containing missing values (geom_point).
## Warning: Removed 156 rows containing missing values (geom_point).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 136 rows containing missing values (geom_point).
## Warning: Removed 126 rows containing missing values (geom_point).
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 72 rows containing missing values (geom_point).
## Warning: Removed 67 rows containing missing values (geom_point).
## Warning: Removed 86 rows containing missing values (geom_point).
## Warning: Removed 85 rows containing missing values (geom_point).
## Warning: Removed 70 rows containing missing values (geom_point).
## Warning: Removed 61 rows containing missing values (geom_point).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 173 rows containing missing values (geom_point).
## Warning: Removed 148 rows containing missing values (geom_point).
## Warning: Removed 155 rows containing missing values (geom_point).
## Warning: Removed 157 rows containing missing values (geom_point).
## Warning: Removed 158 rows containing missing values (geom_point).
## Warning: Removed 174 rows containing missing values (geom_point).
## Warning: Removed 152 rows containing missing values (geom_point).
## Warning: Removed 116 rows containing missing values (geom_point).

ggplot(filter(segment_5k_times, age_group > "45-49"), aes(x = km_segment, y = time)) + geom_jitter(colour = "dodgerblue", alpha = 0.3)+ geom_boxplot(fill = "white", width = 0.2, outlier.shape = NA) + facet_grid(age_group ~ gender) + xlab("5k segment") + ylab("Segment Time") + scale_y_continuous(limits = c(900, 2700), breaks = c(1200,  1800,  2400), labels = c("20:00", "30:00", "40:00")) + ggtitle("Performance over 5k race segments by gender and age group\n2015 Boston Marathon")
## Warning: Removed 31 rows containing non-finite values (stat_boxplot).
## Warning: Removed 59 rows containing non-finite values (stat_boxplot).
## Warning: Removed 21 rows containing non-finite values (stat_boxplot).
## Warning: Removed 66 rows containing non-finite values (stat_boxplot).
## Warning: Removed 20 rows containing non-finite values (stat_boxplot).
## Warning: Removed 49 rows containing non-finite values (stat_boxplot).
## Warning: Removed 21 rows containing non-finite values (stat_boxplot).
## Warning: Removed 41 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 21 rows containing non-finite values (stat_boxplot).
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 22 rows containing missing values (geom_point).
## Warning: Removed 66 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).
## Warning: Removed 49 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).
## Warning: Removed 41 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 59 rows containing missing values (geom_point).
## Warning: Removed 49 rows containing missing values (geom_point).
## Warning: Removed 54 rows containing missing values (geom_point).
## Warning: Removed 46 rows containing missing values (geom_point).
## Warning: Removed 58 rows containing missing values (geom_point).
## Warning: Removed 44 rows containing missing values (geom_point).
## Warning: Removed 38 rows containing missing values (geom_point).
## Warning: Removed 32 rows containing missing values (geom_point).
## Warning: Removed 198 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing missing values (geom_point).
## Warning: Removed 164 rows containing missing values (geom_point).
## Warning: Removed 170 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing missing values (geom_point).
## Warning: Removed 137 rows containing missing values (geom_point).
## Warning: Removed 125 rows containing missing values (geom_point).
## Warning: Removed 95 rows containing missing values (geom_point).
## Warning: Removed 34 rows containing missing values (geom_point).
## Warning: Removed 24 rows containing missing values (geom_point).
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 19 rows containing missing values (geom_point).
## Warning: Removed 19 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).
## Warning: Removed 94 rows containing missing values (geom_point).
## Warning: Removed 77 rows containing missing values (geom_point).
## Warning: Removed 89 rows containing missing values (geom_point).
## Warning: Removed 91 rows containing missing values (geom_point).
## Warning: Removed 88 rows containing missing values (geom_point).
## Warning: Removed 80 rows containing missing values (geom_point).
## Warning: Removed 81 rows containing missing values (geom_point).
## Warning: Removed 65 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 51 rows containing missing values (geom_point).
## Warning: Removed 33 rows containing missing values (geom_point).
## Warning: Removed 38 rows containing missing values (geom_point).
## Warning: Removed 35 rows containing missing values (geom_point).
## Warning: Removed 40 rows containing missing values (geom_point).
## Warning: Removed 38 rows containing missing values (geom_point).
## Warning: Removed 27 rows containing missing values (geom_point).
## Warning: Removed 17 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

4. Who doesn’t finish the race?

Each year, a small number of runners (approximately 1,000) do not finish the race, which results in a result typically referred to as ‘DNF’ (Did Not Finish). What types of runners comprise this group, in terms of age, gender, and nationality?

Appendix: Extra Plots and Exploratory Analysis